Views

Note

View functions are currently in beta and are exported from the inspect_ai.view.beta module. The beta module will be preserved after final release so that code written against it now will continue to work after the beta.

Scores by Task

The scores_by_task() function renders a bar plot for comparing eval scores.

from inspect_viz import Data
from inspect_viz.view.beta import scores_by_task

evals = Data.from_file("evals.parquet")
scores_by_task(evals)

Data Preparation

The scores by task plot is intended to operate on data read directly from an Inspect log directory with the evals_df() function.

Options

By default, scores (y) are plotted by “task_name” (fx) and “model” (x). By default, confidence intervals are also plotted (disable this with y_ci=False). See the scores_by_task() reference for further details on available options.

Scores Timeline

The scores_timeline() function plots eval scores by model, organization, and release date.

from inspect_viz import Data
from inspect_viz.view.beta import scores_timeline

evals = Data.from_file("benchmarks.parquet")
scores_timeline(evals)

Data Preparation

The scores by timeline plot expects Data with the following fields:

  • model: Model name (e.g. “gpt-4o”)
  • organization: Organization that created the model (e.g. “OpenAI”)
  • release_date: Date of model release.
  • eval: Name of eval (e.g. “SWE-bench Verified”)
  • scorer: Scorer used (e.g. “choice”).
  • score: Benchmark score (scaled 0-1).
  • stderr: Standard error.
  • log_viewer: Optional. URL to view evaluation log.

Options

Use the organizations option to customize the list of organizations to include (in order of desired presentation). Use the ci parameter to set a confidence interval (or disable confidence intervals). See the scores_timeline() reference for further details on available options.

Scores with Baseline

The scores_with_baseline() function creates a horizontal bar plot for comparing the scores of different models on a single evaluation, with one or more baselines overlaid as vertical lines.

from inspect_viz import Data
from inspect_viz.view.beta import scores_with_baseline, Baseline

evals = Data.from_file("gpqa_diamond.parquet")
scores_with_baseline(evals, baseline=0.697)

Data Preparation

The scores by task plot is intended to operate on data read directly from an Inspect log directory with the evals_df() function.

Options

You can speciry one more or more Baseline definitions using the baseline option. You can customize the sort order using the sort option (defaults to descening). See the scores_with_baseline() reference for further details on available options.

Tool Calls

The tool_calls() function creates a heat map visualising tool calls over evaluation turns.

from inspect_viz import Data
from inspect_viz.view.beta import tool_calls

tools = Data.from_file("cybench_tools.parquet")
tool_calls(tools)

Data Preparation

To create the plot we read a raw messages data frame from an eval log using the messages_df() function, then filter down to just the fields we require for visualization:

from inspect_ai.analysis.beta import messages_df, MessageColumns, SampleSummary

# read messages from log
log = "<path-to-log>.eval"
df = messages_df(log, columns=SampleSummary + MessageColumns)

# trim columns
tools_df = df[[
    "eval_id",
    "id",
    "order",
    "tool_call_function",
    "limit"
]]

Note that the trimming of columns is particularly important because Inspect Viz embeds datasets directly in the web pages that host them (so we want to minimize their size for page load performance and bandwidth usage).

Options

Use the tools option to customize the list of tools to include (in order of desired presentation). See the tool_calls() reference for further details on available options.